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Abstract 

We  present  a  multithreaded  model  for  the  dynamic  load-balancing  of  numerical,  adap¬ 
tive  computations  required  for  the  solution  of  Partial  Differential  Equations  (PDEs)  on 
multiprocessors.  Multithreading  is  used  as  a  means  of  exploring  concurrency  at  the  proces¬ 
sor  level  in  order  to  tolerate  synchronization  costs  inherent  to  traditional  (non-threaded) 
parallel  adaptive  PDE  solvers.  Our  preUminary  analysis  for  parallel,  adaptive  PDE  solvers 
indicates  that  multithreading  can  be  used  as  a  mechanism  to  mask  overheads  required  for 
the  dynamic  balancing  of  processor  workloads  with  computations  required  for  the  actual 
numerical  solution  of  the  PDEs.  Also,  multithreading  can  simplify  the  implementation  of 
dynamic  load-balancing  algorithms,  a  task  that  is  very  difficult  for  traditional  data  parallel 
adaptive  PDE  computations.  Unfortunately,  multithreading  does  not  always  simphfy  pro¬ 
gram  complexity,  often  makes  code  re-usability  difficult,  and  increases  software  complexity. 
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supplemented),  by  the  National  Aeronautics  and  Space  Administration  under  NASA  Contract  No.  NASl-19480, 
while  the  author  was  in  residence  at  the  Institute  for  Computer  Applications  in  Science  and  Engineering,  (ICASE), 
NASA  Langley  Research  Center,  Hampton,  VA  23681  and  in  part  by  the  Cornell  Theory  Center. 
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1  Introduction 


One  of  the  difficulties  in  parallel  programming  is  attaining  good  performance  when  solving  prob¬ 
lems  with  irregular  and  dynamic  (adaptive)  computations.  Many  of  the  early  successes  of  parallel 
processing  were  obtained  on  relatively  regular  problems  (e.g.,  structured,  static  grids).  The  need 
to  solve  real-life  problems  increased  the  necessity  to  address  issues  related  to  the  parallelization 
of  irregular  and  adaptive  computations  such  as  adaptive  finite-element  computations  for  fluid 
flows  and  structures. 

Traditional  load-balancing  methods  for  non-threaded  computations  under-utilize  multiproces¬ 
sor  resources  such  as  CPU,  memory,  and  network,  since  the  load-balancing  is  carried  out  in 
sequential  phases  that  require  global  synchronizations  [1],  [2].  As  an  alternative  to  the  tra¬ 
ditional  load-balancing  methods  we  propose  a  multithreaded  model.  According  to  this  model 
processors  execute  many  computational  actions  or  threads  of  control;  each  thread  is  typically 
dependent  on  results  of  local  or  remote  threads.  Instead  of  scheduling  these  threads  in  a  static 
and  predefined  way,  we  allow  them  to  be  dynamically  scheduled  based  on  availability  of  the  data 
they  depend  on. 

Multithreading  here  is  used  as  a  mechanism  for  concurrently  executing  actions  or  tasks  required 
for  load-balancing — such  as  information  dissemination,  decision  making,  and  data  migration — 
and  tasks  required  for  the  computation  of  the  actual  application.  Multithreading  improves  CPU 
and  network  utilization  by  masking  the  inherent  synchronization  delays  involved  in  traditional 
non-threaded  load-balancing  methods.  We  prove  that  under  certain  values  of  the  parameters 
(i.e.,  number  of  threads,  and  context  switch  time)  of  the  model,  multithreaded  load-balancing 
systems  are  expected  to  perform  better  than  existing  non-threaded  load-balancing  software. 

The  rest  of  this  paper  is  organized  as  follows:  Section  2  provides  an  overview  of  existing  load¬ 
balancing  methods  and  a  brief  introduction  to  lightweight  threads.  In  Section  3  we  describe 
a  set  of  geometric  abstractions  that  we  use  in  the  rest  of  the  paper.  Section  4  outlines  the 
basic  principles  of  our  load-balancing  algorithm  based  on  the  multithreaded  model.  Section  5 
presents  a  comparison  of  the  load-balancing  algorithm  with  existing  direct  and  incremental  load¬ 
balancing  methods.  Finally,  in  Section  6  we  outline  a  number  of  advantages  and  disadvantages  of 
the  multithreaded  approach  for  parallel  numerical  computing  and  we  conclude  with  future  work 
and  directions. 
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2  Background 

Parallelism,  for  data-parallel  PDE  solvers,  is  achieved  by  decomposing  the  underlying  geometric 
(i.e.,  grids)  or  algebraic  (i.e.,  linear  systems  of  equations)  data  structures.  The  data  decomposi¬ 
tion  of  grids  or  sparse  coefficient  matrices  is  equivalent  to  a  graph  partitioning  problem  which  is 
an  NP-Complete  problem.  During  the  last  8  to  10  years  many  interesting  heuristics  have  been 
proposed  to  compute  sub-optimal  data  distributions  for  static  PDE  problems.  In  this  section  we 
first,  present  a  brief  overview  of  the  most  commonly  used  heuristics  and  introduce  basic  concepts 
of  threads  which  we  will  be  using  throughout  the  paper. 

2.1  Load-balancing:  an  overview 

The  time  it  takes  to  execute  a  program  on  a  multiprocessor  is  equal  to  the  time  it  takes  for  the 
slowest  (overloaded)  processor  to  complete  its  computation  (calculations  and  communication). 
Therefore,  the  objective  of  load-balancing  algorithms  and  software  is  to  minimize  the  execution 
time  of  the  slowest  processor.  Existing  load-balancing  algorithms  and  software  systems  assume 
no  overlapping  of  communication  with  calculations  and  balance  the  processors’^  workloads  by 
minimizing  the  following  objective  function  [48]; 

OF  =  max  {  W{m{Di))  +  ^  C{m{Di),m{Dj))  }  (1) 

where,  for  data-parallel  PDE  solvers,  7n  :  {A}^i  is  a  function  that  maps  the  grid 

points  (grids)  of  submesh  Di  to  processor  Pi  =  m(A);  W{m{Di))  is  the  computational  load  of 
the  processor  m{Di),  which  is  proportional  to  the  number  of  grids  in  A;  C{m{Di),m{Dj))  is 
the  communication  cost  required  between  the  processors  m{Di)  and  m{Dj)-,  and,  finally,  is 
the  set  of  submeshes  that  are  adjacent  to  A-  These  heuristics  are  classified  into  two  classes: 
direct  (or  clustering)  and  iterative  (or  incremental).  A  list  of  very  interesting  results  for  direct 
and  incremental  methods,  which  is  by  no  means  complete,  appears  in  [1],  [13],  [22],  [2.3],  [24], 
[25],  [40],  [41],  [42],  [43],  [44],  [45],  [46],  [48],  and  [49]. 

The  data- clustering  algorithms  are  based  on  grouping  mesh  points  into  clusters  such  that  the 
points  within  a  cluster  have  a  high  degree  of  “natural  association”  among  themselves,  while  the 
clusters  are  “relatively  distinct”  from  each  other.  In  our  case,  “natural  association”  is  expressed 

^We  assume  homogeneous  processors. 
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in  terms  of  the  locality  properties  of  the  finite  element  and  finite  difference  stencils  that  are  used 
to  approximate  a  continuous  PDE  operator.  The  “relative  distinction”  is  expressed  in  terms  of 
the  address  space  that  is  associated  with  the  unknowns  of  the  mesh  or  grid  points  of  the  same 
cluster.  Most  of  these  algorithms  are  computationally  expensive  and  very  successful  in  solving 
the  load-balancing  problem  for  static  PDE  computations  [1.3],  [47],  [48].  They  require  a  complete 
global  knowledge  of  the  graph  (mesh)  and,  therefore,  these  methods  are  not  suitable  for  adaptive 
methods  in  which  the  topology  and/or  geometry  of  the  mesh  change  any  time  h-refinement  is 
performed  throughout  the  PDE  solution  process.  In  addition,  some  of  these  methods  are  sensitive 
to  small  perturbations  (/i-refinement)  and  often  lead  to  heavy  data  migration  [4-3],  [44]. 

On  the  other  hand,  existing  incremental  (non-threaded)  methods  are  not  as  expensive  as 
clustering  methods.  Flaherty’s  group  at  RPI  have  shown  in  [1],  [49]  that  these  methods  are  very 
successful  in  load-balancing  the  computation  of  adaptive  PDE  methods  on  distributed-memory 
MIMD  machines.  Incremental  methods — as  is  true  for  direct  methods — decompose  the  parallel 
adaptive  PDE  computations  into  three  phases:  (i)  computation,  (ii)  balancing  and  (Hi)  data- 
migration.  The  computation  phase  corresponds  to  the  actual  computation  and  inter-processor 
communication  required  by  the  PDE  solver,  while  the  balancing  and  data  migration  phases 
correspond  to  the  calculation  and  inter-processor  communication  required  to  solve  and  enforce 
the  solution  of  the  minimization  problem  defined  by  equation  (1).  A  global  synchronization 
barrier  guarantees  that  all  processors  reach  the  balancing  and  data-migration  phases  at  the  same 
time  [1]. 

2.2  Threads 

Threads  were  used  for  many  years  in  disciplines  such  as  real-time  systems  and  distributed  op¬ 
erating  systems  very  successfully.  A  thread  is  an  independent  set  of  instructions  that  executes 
within  the  context  of  a  UNIX  process  (see  [8],  [9],  [10],  [11],  and  [12]).  Threads  in  multithreaded 
programs  run  logically  concurrently.  For  multicomputers,  the  decomposition  of  a  coarse-grained 
computation  into  finer  grained,  logically  concurrent  executable  threads  is  needed  for  three  rea¬ 
sons:  (1)  to  overlap  in  time  logically  separate  tasks  that  use  different  resources  (i.e.,  network, 
CPU,  disks),  (2)  to  simplify  parallel  programming,  and  (3)  to  load  balance  computations.  Typical 
examples  for  the  logically  concurrent  execution  of  threads  are  (i)  the  overlapping  of  calculation 
with  communication  [14],  and  (ii)  the  overlapping  of  load  balancing  and  actual  computation 
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phases  [15]. 

During  execution  each  thread  can  be  in  one  of  the  following  states:  new,  ready,  running, 
blocked,  dead.  The  state  of  a  thread  is  defined  by  its  current  activity.  When  a  thread  is  created 
it  is  given  a  function  to  run  and  it  is  set  to  a  new  state.  A  thread  in  the  new  state  consists  of 
various  data  structures  that  describe  its  context.  Once  the  data  structures  are  allocated  and  the 
thread  is  registei'ed  with  the  system,  the  thread  moves  to  the  ready  queue  and  its  state  is  set 
to  ready.  If  a  thread  is  selected  to  execute,  then  its  state  changes  to  running.  While  a  thread 
is  in  the  running  state  it  may  decide^  to  wait  for  a  condition  or  for  outstanding  receives  to  be 
signaled,  in  which  case  its  state  changes  to  blocked.  Finally,  after  a  thread  completes  its  execution 
or  decides  to  terminate  its  state,  it  changes  to  dead.  The  use  of  these  states  will  become  clear  in 
Section  4. 


Figure  1:  Thread  state  diagram. 

For  the  type  of  parallel  computations  we  address  in  this  paper  we  need  non-preemptive  schedul¬ 
ing  of  the  threads  (i.e.,  threads  run  to  completion  or  voluntarily  yield  the  CPU).  An  advantage 
of  such  a  scheduling  strategy  is  the  reduction  of  overhead  by  keeping  context-switches^  to  a  min¬ 
imum.  Threads  are  classified  into  heavy-  or  light-weight  threads  based  on  the  amount  of  context 
(weight)  that  needs  to  be  saved/restored  when  a  thread  is  removed  or  reinstated  from/to  the 

^In  the  case  of  non-preemptive  systems. 

^Switching  from  one  thread  to  another  requires  a  certain  amount  of  time  for  administration  (saving  and  loading 
registers  and  memory  maps,  and  updating  various  tables  and  lists). 
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CPU.  In  this  paper  we  use  light-weight  (or  user-defined)  threads. 

The  computation  of  multithreaded  data-parallel  programs  consists  of  two  phases,  namely, 
computation  and  thread-scheduling  phases.  During  the  computation  phase,  the  processor  con¬ 
currently  performs  the  operations  needed  to  execute  the  actual  computation:  (1)  it  initializes 
send/recv  operations,  (2)  it  polls  and  notifies  threads  with  outstanding  recvs,  (3)  it  performs 
actual  calculations,  and  (4)  it  provides  remote  service  requests  such  as  check/change  status  of 
remote  threads.  Finally,  during  the  thread-scheduling  phase  the  processor  does  the  bookkeeping 
required  for  the  administration  and  the  execution  of  threads.  This  is  an  additional  overhead  that 
does  not  appear  in  the  non-threaded,  data-parallel  paradigm. 

3  Basic  abstractions 

We  break  the  original  load-balancing  problem  into  many  simpler  problems  by  defining  a  hierar¬ 
chy  of  geometric  abstractions:  domains,  blocks,  subdomains  and  regions.  Regions  are  mapped 
to  scalar  objects  called  threads.  Threads  execute  in  a  loosely  synchronous  mode.  Based  on 
computation  and  synchronization  requirements,  threads  are  grouped  into  distributed  objects: 
strings,  ropes  and  nets.  Threads  that  correspond  to  regions  of  the  same  subdomain  belong  in 
the  same  string  (see  Figure  2).  Threads  that  belong  in  the  same  string  execute  the  same  code 
on  different  data  (SPMD  model).  Strings  that  correspond  to  the  subdomains  of  the  same  block 
belong  in  the  same  rope.  Threads  on  different  ropes  might  compute  the  solution  for  different 
PDFs,  use  different  grid  types,  apply  different  solution  methods,  and  they  may  therefore  have 
different  computational  requirements  and  synchronization  points  (MPMD  model).  Finally,  ropes 
that  correspond  to  blocks  of  the  same  domain  and  handle  the  computation  associated  with  the 
same  application  belong  in  the  same  net  (see  Figure  2). 

Processor  workload  is  balanced  by:  (i)  migrating  threads  from  overloaded  processors  to  under¬ 
loaded  ones  that  handle  strings  from  the  same  rope,  and  (ii)  by  migrating  strings  from  overloaded 
processors  to  underloaded  ones  that  handle  ropes  from  the  same  net.  The  thread  and  string  mi¬ 
gration  is  non-preemptive  and,  therefore,  instead  of  thread  migration  we  perform  data-migration. 
The  data  is  migrated  so  that  subsequent  communication  for  the  actual  parallel  computation  of 
the  PDF  solver  is  minimized. 

For  each  subdomain,  for  example  Sj^Si  of  Figure  3,  we  identify  two  types  of  regions:  interface 
regions  and  interior  regions.  A  region  is  considered  to  be  interface  if  there  is  a  grid  point  in  the 
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Figure  2:  Geometric  and  parallel  abstractions. 


region  that  has  at  least  one  of  its  adjacent  neighbor  points  residing  in  a  different  context  (tradi¬ 
tionally  non-local  memory).  Non-interface  regions  are  considered  to  be  interior.  Computations 
on  interior  regions  of  different  subdomains  can  be  performed  independently.  For  each  interface 
or  interior  region  we  associate  (create)  one  thready  t  (see  Figure  3).  The  size,  \t\,  of  a  thread  is 
analogous  to  the  number  of  grid  points  of  the  corresponding  region — many  times  in  this  paper 
we  denote  by  t  the  mesh  that  corresponds  to  the  thread  t.  All  threads  for  h-refinement  PDF 
methods  are  of  the  same  size.  The  size  of  a  thread  can  change  during  the  computation  in  order  to 
achieve  better  balancing  of  processors’  work-loads  (i.e.,  each  thread  can  be  split  into  two  or  more 
threads,  depending  on  the  required  load-balancing  resolution;  such  a  resolution  can  be  achieved 
within  a  small  number — order  of  log2  tt|  or  log4  |<| —  of  iterations  compared  to  the  number  of 
iterations  required  by  incremental  load-balancing  algorithms  [1]). 

Interior  threads  execute  exclusively  on  data  residing  in  the  memory  of  the  processor  on  which 
the  threads  execute,  while  interface  threads  require  the  access  of  non-local  data.  Threads  cor¬ 
responding  to  regions  of  the  same  subdomain  belong  in  the  same  process  (context)  and  com¬ 
municate  using  the  shared-memory  (user-address  space  for  the  process)  model,  while  interface 
threads  that  correspond  to  regions  from  different  subdomains  communicate  through  message 
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Figure  3:  Left)  Block,  Bi,  in  middle  row  second  from  the  left  of  Figure  2;  Bi  is  partitioned  into 
four  subdomains,  Right)  Global  thread  graph  and  its  partition  into  four  subgraphs 

that  correspond  to  {Si,Bi}i=i-  Threads  with  data  dependencies  (edges)  that  cross  the  internal 
boundary  of  the  subdomains  are  interface  threads,  otherwise  they  are  interior  threads. 

passing  (for  more  details,  see  [17]).  Two  types  of  local  communication  are  identified:  inter-block 
(or  inter-group)  and  intra-block  (or  intra-group).  Network  traffic  can  be  reduced  by  using  a 
scheduling  communication  mechanism  between  threads  with  different  types  of  data  flow  needs. 
Such  schedulings  can  be  achieved  by  grouping  threads  of  the  same  block  into  ropes  [37],  [27],  [28]. 
Ropes  can  use  group  synchronization  and  collective  communication  mechanisms  and  more  than 
one  rope  can  share  resources  such  as  CPU,  network,  and  memory.  For  more  details  on  ropes,  see 
[27]  and  [37]. 

4  Multithreaded  approach  for  load-balancing 

In  this  section  we  describe  a  multithreaded  model  for  developing  load-balancing  (sharing)  algo¬ 
rithms.  The  traditional  non-threaded  approach  for  load-balancing  of  PDE  computations  leads 
to  (1)  under-utilization  of  multiprocessor  resources  such  as  the  CPU  and  network  and  (2)  in 
some  cases  intensification  of  problems  like  network  contention  — due  to  the  fact  that  all  pro¬ 
cessors  perform  data  migration  simultaneously.  In  this  section  we  propose  a  new  approach  for 
load-balancing  that  explores  concurrency  within  the  processor  in  order  to  maximize  utilization  of 
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multiprocessor  resources  without  sacrificing  program  complexity.  Our  approach,  in  contrast  to 
the  direct  and  incremental,  non-threaded,  load-balancing  methods,  attempts  to  ensure  that  no 
processor  is  waiting  idle  while  more  than  one  thread  remains  to  be  executed  on  other  processors. 
Each  processor,  when  the  need  arises,  requests  work  (threads)  from  a  subset  (neighborhood)  of 
processors  that  are  overloaded  or  slow. 

Independent  of  the  approach  (traditional  or  multithreaded)  we  use  to  load-balance  PDE  com¬ 
putations,  we  should  focus  on  three  fundamental  issues,  namely;  (1)  Tnemory  latency,  (2)  syn¬ 
chronization  cost,  and  (3)  convergence  rate.  In  this  paper  we  address  the  memory  latency  and 
synchronization  cost;  it  is  difficult  to  uncouple  these  issues  and  therefore  we  have  to  consider  the 
convergence  rate  of  PDE  solvers  whenever  we  deal  with  message  passing  and  scheduling  latencies. 
In  the  rest  of  the  paper  we  address  issues  related  to  convergence  rate  of  PDE  solvers  only  when 
it  is  absolutely  necessary. 

4.1  Memory  latency 

Parallel  computers  introduce  a  new  level  in  the  storage  hierarchy;  in  addition  to  registers,  cache 
and  memory,  there  is  remote  memory  that  is  accessed  across  an  interconnection  network.  In 
this  paper  our  objective  is  to  distribute  the  computation  and  data  so  that  we  not  only  balance 
processors’  workloads  but  also  minimize  overheads  due  to  message  passing  (i.e.,  memory  laten¬ 
cies).  In  order  to  achieve  our  objective:  (1)  we  minimize  the  access  of  non-local  unknowns  by 
minimizing  the  number  of  grid  points  that  reside  on  the  interfaces  of  the  subdomains  (see  Figure 
3)  and  (2)  we  mask  memory  latencies  due  to  access  of  non-local  data  by  overlapping  calculations 
with  communication. 

For  multithreaded  parallel  PDE  computations  we  identify  two  types  of  communications:  first, 
the  communication  between  interior  and  interface  threads  that  reside  in  the  same  context  (local 
threads),  and  second,  the  communication  between  interface  threads  that  belong  to  different 
contexts  (remote  threads).  The  efficient  communication  of  interior  and  interface  threads  is  critical 
for  the  overall  performance  and  success  of  the  multithreaded  approach.  Next  we  briefly  describe 
the  different  communication  mechanisms  between  local  and  remote  threads.  For  more  details  on 
the  implementation  of  these  mechanisms,  see  [35],  [36],  [37],  [20],  [38],  and  [39]. 

The  communication  between  local  threads  can  be  implemented  using  one  of  the  following 
three  approaches:  (1)  message-passing,  (2)  shared  variables,  or  (3)  W-buffer  scheme  [35].  An 
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advantage  of  the  first  approach  is  that  it  treats  the  communication  of  both  interior  and  interface 
threads  in  a  uniform  way,  thereby  simplifying  programming.  A  disadvantage  is  that  its  imple¬ 
mentation  (with  some  exceptions,  e.g.,  Mach)  requires  copies  of  thread-specific  data-structures 
from  one  thread  to  another;  recall  that  (a)  the  data-structures  reside  in  the  same  user-address 
space  and  (b)  memory  copy  operations  introduce  an  additional  overhead  that  does  not  appear 
in  the  non-threaded  approach.  The  second  approach  eliminates  additional  copy  operations  by 
using  shared  global  data-structures.  A  disadvantage  of  this  approach  is  that  it  requires  the  use 
of  synchronization  mechanisms  (mutexes)  to  protect  unwanted  reads/writes  from/to  shared  vari¬ 
ables.  The  implementation  of  this  approach  increases  program  complexity  and  requires  drastic 
code  restructuring  of  existing  non-threaded  programs. 

Finally,  the  third  approach  we  present  in  [35]  is  a  communication  scheme  specific  to  PDF 
computations^  and  is  based  on  W  (>  2)  copies  of  the  shared- variables.  The  idea  of  this  scheme 
is  to  use  “rondeau”  memory  locations  in  such  a  way  that  a  thread  (destination  in  the  case  of 
message  passing)  always  reads  the  correct  values  that  its  partner  (source)  just  wrote.  Read  and 
write  operations  between  threads  that  share  these  copies  of  variables  are  interleaved  (odd/even 
— mod  W —  iterations  in  the  case  W  =  2  )  in  a  way  that  the  overwriting  (by  one  thread)  of  useful 
values  (to  another  thread)  is  prevented.  Therefore,  this  scheme  preserves  the  integrity  of  shared 
variables  without  substantially  increasing  program  complexity  (in  the  case  of  non-copy  shared- 
variables)  and  without  introducing  overheads  of  unnecessary  and  expensive  copy  operations  (in 
the  case  of  message  passing)  on  local  data  structures.  This  approach  can  be  implemented  on  top 
of  the  existing  non-threaded  data-parallel  codes  with  minimum  modifications.  A  disadvantage 
of  this  communication  scheme  is  that  the  storage  complexity  is  increased  by  O  •  ■y/Rj)  • 

The  communication  between  remote  threads  (i.e.,  threads  that  reside  in  the  memory  of  different 
system  processors)  can  be  implemented  on  top  of  existing  message-passing  such  as  MPI  [19],  p4 
[29],  or  PVM  [30].  Unfortunately,  most  of  the  currently  available  message-passing  software  do 
not  provide  support  for  sending/receiving  messages  to  (from)  a  specific  entity  (function)  of  a 
process.  To  provide  thread-to-thread  communication  mechanism  we  use  an  idea  similar  to  the 
Active  Messages  (AM)  which  is  described  next.  For  more  details  on  the  implementation  of  the 
thread-to-thread  communication  mechanism,  see  [36] ,  [37] ,  [20]  and  [39] . 

An  interface  thread  that  executes  for  the  first  time  sends  its  messages  to  other  threads,  then 

However  it  can  be  generalized  for  many  other  similar  computations. 
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posts  all  its  receives^  and  voluntarily  yields  the  CPU  to  the  dispatcher.  The  dispatcher  does 
the  proper  bookkeeping  and  schedules  the  next  interface  thread — if  any.  After  all  the  interface 
threads  from  the  ready  queue  are  exhausted  the  dispatcher  schedules  the  first  interior  thread. 
Interior  threads  require  only  local  data  and  therefore  execute  until  completion.  This  process  is 
repeated  until  all  interior  threads  are  also  exhausted  from  the  ready  queue. 

During  the  time  interior  threads  perform  their  own  computations,  non-local  data  is  arriving 
from  the  network  (see  Figure  4).  The  non-local  data  is  stored  in  user-address  space  and  in 
memory  locations  that  the  user  provides  to  the  OS.  In  [36]  we  describe  a  mechanism  where  a 
specific  function  (message-handler)  is  activated  (on  message  arrival  the  process  is  interrupted  by 
a  hardware  signal)  and  performs  the  following  operations: 

•  Parses  the  header  of  the  message  and  identifies  the  destination  thread. 

•  Decrements  a  thread  counter  that  indicates  the  number  of  outstanding  receives  that  corre¬ 
spond  to  this  thread;  if  the  counter  of  outstanding  receives  becomes  zero  then  the  state  of 
the  thread  changes  from  blocked  to  ready,  and  the  thread  moves  from  the  blocked  stack  to 
the  ready  stack  (see  Figure  4).  At  this  point,  interface  threads  have  all  the  data  (local  and 
non-local)  they  need  to  perform  their  computations.  Notice  that  while  interface  threads  are 
waiting  for  incoming  messages  (through  the  network),  the  CPU  is  utilized  by  the  interior 
threads,  and  thus  calculations  and  communication  are  overlapped. 

4.2  Synchronization  cost 

Load-balancing  operation  is  a  special  case  of  the  producer-consumer  operation.  Consumer- 
producer  operations  like  forks  and  joins  and  mutual  exclusion  in  parallel  programming  require 
synchronization.  In  parallel  adaptive  PDF  computations  the  synchronization  cost  appears  in 
the  form  of  waiting  time  due  to  unbalanced  processor  workloads.  Our  objective  is  to  minimize 
this  time  and  mask  if  possible  inherent  delays  involved  in  the  traditional  load-balancing  methods 
without  increasing  program  complexity.  Our  approach,  in  contrast  to  the  direct  and  incremental 
non-threaded  load-balancing  methods,  attempts  to  ensure  that  no  processor  is  idle  while  more 

•  Hi  a  non-blocking  receive  operation  is  available,  it  first  posts  its  receives  and  then  sends  its  messages  to  other 
threads,  which  increases  the  probability  of  saving  a  local  copy  of  the  message  from  the  system  buffer  to  user 
address  space. 


10 


Figure  4:  Overlapping  of  computation  with  communication;  while  interface  threads  are  blocked 
in  the  blocked  stack  waiting  for  the  arrival  of  non-local  data,  interior  threads — from  the  ready 
stack — utilize  the  CPU,  performing  computations  on  local  data. 

than  one  thread  remains  to  be  executed  on  other  processors.  Each  processor,  when  the  need 
arises,  requests  work  (threads)  from  a  subset  (neighborhood)  of  processors  that  are  overloaded 
or  slow. 

Let  us  assume  that  our  processors  are  homogeneous  and  our  PDE  solver  does  not  share  the  re¬ 
sources  of  the  system  with  any  other  application.  We  define  as  computational  graph  Gc{Ec,  Vc), 
whose  vertices,  u,-,  correspond  to  the  submeshes  Di,  and  the  edges  Cjj  connect  two  vertices  (u,,  vj) 
if  Di  n  Dj  ^  The  weights,  le,-,  on  the  vertices  Vi  of  the  graph  correspond  to  the  compu¬ 
tation  associated  with  the  submesh  Di  and  are  analogous  to  the  number  of  mesh  points,  \Di\. 
Figure  5-right  depicts  the  computational  graph  of  the  refined  mesh  (Figure  5-left);  the  weights 
Wi  indicate  the  number  of  threads  per  subdomains  (i.e.,  context  or  processor). 

During  the  computation  of  adaptive  PDE  methods,  the  mesh  is  refined  in  areas  where  the  res¬ 
olution  of  the  solution  is  larger  than  a  given  tolerance  (see  Figure  5).  After  the  mesh  refinement 
is  completed,  new  threads  are  created  (or  old  ones  are  destroyed)  at  runtime.  All  threads  are 
approximately  of  the  same  size  (/^-refinement).  Processor  computation  is  balanced  by  migrating 
interface  threads.  The  thread  migration  is  non-preemptive  (i.e.,  threads  migrate  before  they 
start  execution)  and,  therefore,  instead  of  thread  migration,  (save-and-migrate  threads  context. 
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Figure  5;  Left)  h-refinement  and  a  16-way  partition  of  the  block  Bi  of  domain  defined  in  Figure 
2.  Right)  Computational  graph  with  uneven  number  of  threads  per  subdomain  (i.e.,  context  or 
processor). 

registers- values,  etc.)  we  perform  data- migration  of  mesh-points  only.  The  data  is  migrated  so 
that  subsequent  communication  for  the  actual  parallel  computation  of  the  PDE  solver  is  mini¬ 
mized.  In  contrast  to  traditional  incremental  methods,  load  sharing  of  processors  is  completed 
within  the  first  iteration  (i.e.,  before  any  global  reduction  operation  is  required  for  error  checking 
or  update  of  global  variables). 

The  policy  for  thread  migration  is  based  on  a  consumer-initiated  consumer/producer  (C/P) 
paradigm.  That  is,  every  processor  Pi  =  m{Di)  (consumer),  after  it  completes  its  computation 
(when  counter  of  ready  and  blocked  stacks  is  zero),  searches  its  neighborhood 

N{PiJ)  =  {Pj,Pj  =  m{Dj)  and  Dj  €  N{Di,l)  J  =  I,...,  diameter{Gc)  } 

to  identify  neighboring  processors  that  are  overloaded.®  Since  our  model  attempts  to  assure  that 
no  processor  remains  idle,  the  consumer  sends  interrupt-driven  messages  to  its  neighbors  (see 
[.36]  for  implementation  details)  and  requests  the  migration  of  one  or  more  threads. 

After  an  overloaded  processor  Pj  G  N{Pi,  1)  is  identified,  P,  (producer)  interrupts  its  computa- 

®Due  to  changes  in  the  demand  for  computation,  in  the  case  of  adaptive  methods,  or  due  to  external  loads  of 
the  processors  in  the  case  of  time-sharing  heterogeneous  workstations. 
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Figure  6:  Threads  T4  migrates  from  processor  Pq  to  processor  Pi;  T4  has  most  of  its  data 
dependencies  with  the  threads  of  processor  Pi.  The  thread  migration  is  based  on  the  principle 
of  minimizing  the  communication  of  the  actual  computation. 


tion  and  sends  a  thread  (data)  to  p.  The  thread  that  is  migrated  from  the  producer  to  consumer 
is  likely  to  have  data  dependencies  with  other  threads  that  already  reside  on  the  consumer’s  side. 
The  producer’s  decision  to  migrate  an  interface  thread,  U,  is  based  on  priorities,  n<.,  that  are 
computed  at  runtime  using  the  following  equation: 


Uti  =  (1 


where 

1  if  m(t,)  ^  m(tj) 

and  Nt-  is  the  set  of  all  threads  tj  with  data  dependencies  on  U.  Figure  6  depicts  the  different 
priorities  for  an  interface  thread  T4  of  processor  Pq. 

A  consumer  (processor)  might  get  data  from  more  than  one  producer.  In  this  case  it  creates 
and  executes  remote  threads  one  at  a  time,  using  a  FIFO  policy.  Before  a  remote  thread, 
is  created  by  the  consumer,  the  consumer  uses  a  remote  service  request,  Check_Thread_State, 
to  find  out  the  current  state  of  thread  Umt  on  the  processor  (producer)  from  which  thread  Umt 
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has  come  from.  If  the  current  state  of  trmt  thread  is  rtddy^  then  trmt  i®  created  and  scheduled 
for  execution  on  the  consumer;  its  state  on  the  producer  changes  from  ready  to  dead.  The 
Check_Thread_State  is  also  an  interrupt-driven  RSR  (see  [36]  for  implementation  details).  Notice 
that  in  contrast  to  existing  load-balancing  methods,  no  global  synchronization  is  required  for 
load  sharing  between  producer  and  consumer  processors.  Also,  the  consumer  performs  most  of 
the  computation  required  for  load  sharing  decision.  Since  underloaded  processor  (consumers)  are 
anyway  idle,  we  can  use  them  for  the  extra  work  required  for  solving  the  load-balancing  problem. 
Also,  notice  that  we  use  interrupt-driven  remote  service  requests  so  that  the  consumer  can  get 
data  and  schedule  as  soon  as  possible  the  remote  threads  that  have  migrated  to  its  context. 
Existing  load-balancing  methods  are  based  on  global  synchronization  barriers  that  have  to  be 
reached  by  all  processors.  In  the  case  of  non-threaded  incremental  methods,  this  implies  that 
some  processors  have  to  wait,  since  the  processors’  loads  are  balanced  gradually. 


5  Analysis 

In  this  section  we  compare  the  multithreaded  approach  with  traditional  incremental  methods 
only,  since  direct  methods  are  very  expensive  to  be  used  for  the  load  balancing  of  parallel  adaptive 
computations.  Consider  the  computation  graph  of  Figure  5  and  let  Tst  be  the  total  execution 
time  required  by  the  PDE  solver — whose  computation  is  balanced  by  an  incremental  algorithm — 
that  performs  N  iterations  until  the  next  mesh  refinement  occurs.  An  incremental  method  will 
balance  the  computation  in  K  iterations.  For  each  iteration  i,  with  1  <  z  <  A",  let  denote 

the  maximum  workload  oyer  all  processors.  Let  Tsib  be  the  summation  of  time  needed  for  the 
decision  making,  communication,  and  packing  data  to  be  migrated  from  an  overloaded  processor 
to  an  underloaded  one.  Once  the  processors  decide  on  the  data  to  be  migrated,  they  send/receive 
the  additional  data,  we  denote  this  time  as  Tmigr  • 

Taking  into  account  that  the  slowest  (most  overloaded)  processor  dominates  the  execution 
time  at  each  iteration,  we  can  compute  the  total  execution  time  between  two  mesh  refinements 
by: 

Tst  =  ^  ^max  A  (A  ~  K)^max  +  A  '  {Tslb  +  Tmigr)  (2) 

i=l 

Let  +  Ai  where  A*-  depends  on  the  application  and  effectiveness  of  the  incremental 
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algorithm  and  let  C\  —  where  A,-,  and  subsequently  Ci,  are  relatively  large  constants/ 

Then  (2)  can  be  written  as  : 

T.,  =  K.W^  +  C,  +  N-  -  K  ■  +  K  ■  (Tm  +  w)  ^ 

T.,  =  N-WS„  +  K-(T.,i  +  T„i„)  +  Ci  (3) 

Now,  consider  again  the  same  computation  (  Figure  5)  as  above  and  let  T^t  be  the  total  execu¬ 
tion  time  required  by  the  PDE  solver — whose  computation  is  load  balanced  by  a  multithreaded 
load-sharing  algorithm — that  performs  N  iterations  until  the  next  mesh  refinement  occurs.  Then 
Tmt  is  equal  to  ; 

N 

Tmt  =  •  Tctxi)  +  L  ■  {Tmlb  +  Tput) 

Tmt  =  N  •  +  N  •  Nf  Tctxt  +  L  •  {Tmlb  +  Tput)  (4) 

where  is  the  average  load  using  threads,  L  is  the  number  of  times  the  slowest  processor  has 
to  migrate  data,  and  Nt  is  the  maximum  number  of  threads.  Since  we  adjust  (reduce)  the  size 
of  the  threads  in  order  to  get  better  load  resolution  (unit-wise,  where  a  unit  can  be  an  element 
for  FEM  or  a  grid  point  for  FD),  we  can  say  that  after  a  small  number  of  iterations,  M,  that 
depends  on  the  size  of  the  thread,  we  can  have  ~  Wavr,  which  is  very  close  to  perfect  load 
balance,  unit- wise.  For  example,  for  threads  with  lOO’s  of  units  (elements  of  grid  points),  by 
reducing  the  thread  size,  |t|,  each  time  by  half,  we  can  achieve  perfect  balance  in  fewer  than  ten 
iterations.  Also,  let  Si  —  -  Wavr  for  i  <  M  then  Co  =  is  a  small  constant.  Then  (4) 

can  be  written  as  ; 


Tmt  —  A  •  WavT  N  ■  Nt  Tctxt  +  T  ■  {Tmlb  +  Tput)  +  Co 


(5) 


From  equations  (3)  and  (5)  and  the  fact  that  =  Waw  +  ct  •  tunit  (usually  a  »  1  [1]), 

we  see  that  a  multithreaded  load-sharing  algorithm  can  be  more  efficient  than  any  non-threaded 
incremental  algorithm  if: 


Nt< 


N-{W^ux-Waur)  +  K-{Ts,b  +  Tmigr  )  +  C*2  —  T  •  {Tmlb  +  Tput) 

N  •  Tctxt 


(6) 


For  light-weight  threads,  Tctxt  as  well  as  Tmib  and  Tput  is  of  an  order  of  tens  of  micro-seconds 
[32] .  Therefore  N ■  Tctxt  and  L  •  {Tmib  +  Tput)  are  very  small  numbers  compared  to  K •  {T^ib  +  Tmigr), 

^For  the  examples  that  appear  in  [50],  A,-  varies  from  10  to  50  units,  and  K  varies  from  a  few  tens  of  iterations 
to  a  few  100 ’s  of  iterations. 
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N  -a-  iuniu  and  C2  =  {Ct  -  Co)  ®  [1]  that  are  in  the  order  of  seconds.  Therefore,  theoretically, 
a  light-weight  multithread  system  with  a  reasonably  large  number  of  threads,  Nt,  is  capable  of 
improving  the  performance  of  parallel  adaptive  PDE  methods  even  further.  A  careful  and  very 
efficient  implementation  of  such  a  model  will  be  able  to  realize  the  above  expectations.  This 
is  easier  when  high-order  schemes  are  used  or  problems  with  many  degrees  of  freedom  per  grid 
point  since  —  Waw  >>  1- 

6  Discussion  -  Conclusions 

Existing  load-balancing  algorithms  require  that  all  processors  enter  the  balancing  phase  at  the 
same  time — guaranteed  by  global  synchronization  barriers.  This  requirement  leads  to:  (1)  the 
under-utilization  of  resources  such  as  the  CPU  and  network  because  many  processors  may  wait 
until  the  overloaded  or  slow  processors  reach  the  global  barrier,  and  (2)  the  intensification  of 
problems  like  network  contention  due  to  exclusive  use  either  of  the  network  (data-migration) 
or  of  the  CPU  (decision-making).  Concurrent  execution  of  tasks  required  for  load-balancing 
with  tasks  required  for  the  actual  computation  is  the  key  ingredient  for  developing  efficient  load¬ 
balancing  algorithms.  Here  threads  are  used  as  a  mechanism  to  explore  concurrency  at  the 
processor  level  in  order  to  tolerate  memory  latency  and  mask  synchronization  costs  inherent  in 
traditional  load-balancing  methods.  It  is  important  that  threads  tolerate  memory  and  scheduling 
latencies  without  sacrificing  program  simplicity  and  portability. 

Our  preliminary  experimental  data  using  CHANT  [37]  and  NEXUS  [20]  indicate  that  for  up  to 
64  threads,  the  overhead  introduced  due  to  context  switch  is  very  small  compared  to  the  execution 
time  required  for  the  computations  required  for  the  numerical  solution  of  the  PDE.  Moreover, 
for  applications  with  a  large  number  of  coarse-grain  threads  (up  to  a  point)  can  minimize  cache 
misses  and  improve  performance  (of  course  the  same  performance  can  be  achieved  by  the  re¬ 
structuring  of  non-threaded  programs  — an  error  prone  process).  Also,  our  preliminary  data 
indicates  that  with  up  to  32  threads  on  SP2  one  can  overlap  computation  with  communication 
and  improve  processor  and  network  utilization. 

Finally,  further  research  is  required  in  a  number  of  directions:  (1)  evaluation  of  multithreaded 
approach  with  respect  to  numerical  stability  of  PDE  solvers,  (2)  evaluation  and  calibration  of 

®Co  <  Cl,  since  M  usually  is  of  the  order  of  10  (/o^2l00),  while  K  can  be  from  a  few  tens  to  a  few  hundreds 
of  iterations  [1]  and  6,  decreases  each  time  by  half,  while  A,  can  be  between  a  few  units  to  tens  of  units. 
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the  model  using  real  applications  -  we  investigate  the  use  of  this  approach  for  dealing  with 
load  balancing  problems  in  numerical  relativity  codes  [55],  (3)  generalization  of  the  model  to 
handle  hp-refinement  methods,  (4)  development  of  transformation  mechanisms  for  converting 
off-the-shelf  vast  number  FORTRAN  libraries  for  PDFs  from  non-threaded  to  multithreaded 
programming  paradigm,  (5)  creation  of  multithreaded  runtime  support  systems  for  parallel  com¬ 
pilers  -we  investigate  this  within  the  Bernoulli  compiler  [56] ,  and  problems  solving  environments 
-we  investigate  this  within  the  Parallel  ELLPACK  environment  [57]. 
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